1,409 research outputs found
Extended Bit-Plane Compression for Convolutional Neural Network Accelerators
After the tremendous success of convolutional neural networks in image
classification, object detection, speech recognition, etc., there is now rising
demand for deployment of these compute-intensive ML models on tightly power
constrained embedded and mobile systems at low cost as well as for pushing the
throughput in data centers. This has triggered a wave of research towards
specialized hardware accelerators. Their performance is often constrained by
I/O bandwidth and the energy consumption is dominated by I/O transfers to
off-chip memory. We introduce and evaluate a novel, hardware-friendly
compression scheme for the feature maps present within convolutional neural
networks. We show that an average compression ratio of 4.4x relative to
uncompressed data and a gain of 60% over existing method can be achieved for
ResNet-34 with a compression block requiring <300 bit of sequential cells and
minimal combinational logic
CAS-CNN: A Deep Convolutional Neural Network for Image Compression Artifact Suppression
Lossy image compression algorithms are pervasively used to reduce the size of
images transmitted over the web and recorded on data storage media. However, we
pay for their high compression rate with visual artifacts degrading the user
experience. Deep convolutional neural networks have become a widespread tool to
address high-level computer vision tasks very successfully. Recently, they have
found their way into the areas of low-level computer vision and image
processing to solve regression problems mostly with relatively shallow
networks.
We present a novel 12-layer deep convolutional network for image compression
artifact suppression with hierarchical skip connections and a multi-scale loss
function. We achieve a boost of up to 1.79 dB in PSNR over ordinary JPEG and an
improvement of up to 0.36 dB over the best previous ConvNet result. We show
that a network trained for a specific quality factor (QF) is resilient to the
QF used to compress the input image - a single network trained for QF 60
provides a PSNR gain of more than 1.5 dB over the wide QF range from 40 to 76.Comment: 8 page
Slotted ALOHA Overlay on LoRaWAN: a Distributed Synchronization Approach
LoRaWAN is one of the most promising standards for IoT applications.
Nevertheless, the high density of end-devices expected for each gateway, the
absence of an effective synchronization scheme between gateway and end-devices,
challenge the scalability of these networks. In this article, we propose to
regulate the communication of LoRaWAN networks using a Slotted-ALOHA (S-ALOHA)
instead of the classic ALOHA approach used by LoRa. The implementation is an
overlay on top of the standard LoRaWAN; thus no modification in pre-existing
LoRaWAN firmware and libraries is necessary. Our method is based on a novel
distributed synchronization service that is suitable for low-cost IoT
end-nodes. S-ALOHA supported by our synchronization service significantly
improves the performance of traditional LoRaWAN networks regarding packet loss
rate and network throughput.Comment: 4 pages, 8 figure
XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference
Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to
conventional deep neural networks at a fraction of the cost in terms of memory
and energy. In this paper, we introduce the XNOR Neural Engine (XNE), a fully
digital configurable hardware accelerator IP for BNNs, integrated within a
microcontroller unit (MCU) equipped with an autonomous I/O subsystem and hybrid
SRAM / standard cell memory. The XNE is able to fully compute convolutional and
dense layers in autonomy or in cooperation with the core in the MCU to realize
more complex behaviors. We show post-synthesis results in 65nm and 22nm
technology for the XNE IP and post-layout results in 22nm for the full MCU
indicating that this system can drop the energy cost per binary operation to
21.6fJ per operation at 0.4V, and at the same time is flexible and performant
enough to execute state-of-the-art BNN topologies such as ResNet-34 in less
than 2.2mJ per frame at 8.9 fps.Comment: 11 pages, 8 figures, 2 tables, 3 listings. Accepted for presentation
at CODES'18 and for publication in IEEE Transactions on Computer-Aided Design
of Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu
- …